2021-03-21

The Melbourne Housing Snapshot Dataset

  • Home Sales in 2017
    • Location
    • Construction
    • Sale
  • Variables: 21
    • Numeric: 12
    • Categorical: 9

The Variables

Rooms: Number of rooms

Price: Price (AUS$)

Method: Method of sale - 5 categories

Type: House, Unit, Townhouse - 3 categories

SellerG: Real Estate Agent - 268 categories

Date: Date sold

Distance: Distance from Central Business District

Regionname: Region name - 8 categories

Propertycount: Number of properties that exist in the suburb

More Variables

Bedroom2 : Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year home built

CouncilArea: Governing council for the area - 34 categories

Lattitude, Longtitude: GPS location

Suburb: Suburb name - 314 categories

Goals

  1. Understand which attributes of a home and its sale determine final sale price
  2. Attempt to build a reasonable model for inference and/or prediction for final sale price

Summary of Price Statistics

Mean: $1,075,684

SD: 639310.724

data.full$Price
Min 85000
Q1 650000
Median 903000
Mean 1075684
Q3 1330000
Max 9000000

Select Data Pairs

Corrleations

Map of Melbourne Sales

Selling Price

Log Selling Price

Price by Region

Price by Number of Rooms (<10 Rooms)

Price by Type of Home

Test of Independence by Group (Pearson \(\chi^2\))

Type, Rooms, Regionname, SellerG

\(H_0\): All means equal by group

All reject \(H_0\) with p-value\(<2\times 10^{-16}\)

Price by Region and Type

First Attempt at Linear Model

Remove the Variable with Highest VIF

Model Coefficients

Consider Interactions

Remove Land Size

Model 4 Coefficients

Resdidual Analysis - Homogeneity? No

Resdidual Analysis - Normal? Not Quite

Resdidual Analysis - Influence? Yes

Remove Influence Points

Proposed Model

Testing \(R^2\)

\[ \begin{equation} R^2 = 1- \dfrac{RSS}{TSS} \end{equation}=0.441\]

Transform Data - Homogeneity

Transform Data - Normality

Future Work

  • Further explore log transformation
  • Consider GLM with log link
  • What to do about factors with many levels (100’s)?
  • Missing data
  • Improve Prediction